The logical structure of experiments lays the foundation for a theory of reproducibility

The scientific reform movement has proposed openness as a potential remedy to the putative reproducibility or replication crisis. However, the conceptual relationship among openness, replication experiments and results reproducibility has been obscure. We analyse the logical structure of experiments, define the mathematical notion of idealized experiment and use this notion to advance a theory of reproducibility. Idealized experiments clearly delineate the concepts of replication and results reproducibility, and capture key differences with precision, allowing us to study the relationship among them. We show how results reproducibility varies as a function of the elements of an idealized experiment, the true data-generating mechanism, and the closeness of the replication experiment to an original experiment. We clarify how openness of experiments is related to designing informative replication experiments and to obtaining reproducible results. With formal backing and evidence, we argue that the current ‘crisis’ reflects inadequate attention to a theoretical understanding of results reproducibility.


Introduction
In a number of scientific fields, replication and reproducibility crisis labels have been used to refer to instances where many results have failed to be corroborated by a sequence of scientific experiments. This state of affairs has led to a scientific reform movement. However, this labelling is ambiguous between a crisis of practice and a crisis of conceptual understanding. Insufficient attention has been given to the latter, which we believe is a detriment to moving forward to conduct science better. In this article, we make theoretical progress towards understanding replications and © 2023 The Authors. Published by the Royal Society under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/4.0/, which permits unrestricted use, provided the original author and source are credited.
reproducibility of results (henceforth, 'results reproducibility') 1 by a formal examination of the logical structure of experiments. 2 We view replication and reproducibility as methodological subjects of metascience. As we have emphasized elsewhere [10], these methodological subjects need a formal approach to properly study them. Therefore, our work here is necessarily mathematical; however, we make our conclusions relatable to the broader scientific community by pursuing a narrative form in explaining our framework and results within the main text. Mathematical arguments are presented in the appendices. Our objective is to build a strong, internally consistent, verifiable theoretical foundation to understand and to develop a precise language to talk about replication experiments and results reproducibility as well as to use this framework to study how openness in and of experiments is related to either of them. We advance mathematical arguments from first principles and proofs, using probability theory, mathematical statistics, statistical thought experiments and computer simulations. We ask the reader to evaluate our work within its intended scope of providing theoretical precision and nuanced arguments.
The following backdrop to motivate our research matters: a common concern voiced in the scientific reform literature and recent scholarly discourse regards various forms of scientific malpractice as potential culprits of reproducibility failures, and openness is sometimes touted as a remedy to alleviate such malpractices [11][12][13][14][15]. Some malpractice is believed to take place at the level of the scientist. For example, hypothesizing after the results are known involves presenting a post hoc hypothesis as if it were an a priori hypothesis, conditional on observing the data [16,17]. Another example is p-hacking, a statistically invalid form of performing inference to find statistically significant results [17][18][19]. Some is believed to operate at the community or institution level. For example, publication bias involves omitting studies with statistically non-significant results from publications and is primarily attributed to flawed incentive structures in scientific publishing [1,17]. Transparency in scientific practice in general and tools to promote openness in experimental reporting (such as preregistration, registered reports, and open laboratory books) in particular are often highlighted as potential remedies to curb such malpractice. Before we suspect malpractice of either kind and set out to correct the scientific record or demand reparations, however, it behoves the scientific community to gain a complete understanding of the factors that may account for a given set of results in a sequence of replication experiments. This way we can hope to understand what aspects of experiments need to be openly communicated and to what end.
If a result of an experiment is not reproduced by a replication experiment, before we reject it as a false positive or suspect some form of malpractice, we need to assess and account for: (i) sampling error, (ii) theoretical constraints on the reproducibility rate of the result of interest, conditional on the elements of the original experiment, and (iii) assumptions from the original experiment that were not carried over to the replication experiment. First of these is a well-known and widely understood statistical fact that describes why methodologically we can at best guarantee reproducibility of a result on average (i.e. in expectation). The second point about the theoretical limits of the reproducibility rate is not well understood, and we hope to address this oversight in this article. The last one has been brought up in individual cases but typically in an ad hoc manner, and we aim to provide a systematic approach for comprehensive evaluations of replication experiments. Since metascientific heuristics may lead us astray in these assessments, we need a fine-grained conceptual understanding of how experiments operate and relate to each other, and what role openness plays in facilitating replications or promoting reproducible results. Indeed a replication crisis and a reproducibility crisis are different things and should be understood on their own. We distinguish between replication experiments and results reproducibility, discuss precursors of each, and assess how openness of experiments relates to each separately. 1 We focus on the end products of experiments and the results and not other components of experiments that bring those products about. This choice is fitting given the etymology of the term 'reproduce' in the sense of producing a given result, and by its formal association with statistical theory. In our research, we have aimed to use results reproducibility or reproducibility of results consistently. This usage is not idiosyncratic or esoteric. Reproducibility has been defined in a similar (if less technical) fashion by other scholars [1][2][3][4]. Unfortunately, there is variation in usage of these terms in the metascientific literature [5,6]. For example, replicability may refer to what we call results reproducibility (e.g. [7]). Further, reproducibility may convey computational reproducibility of the results given the data; i.e. obtaining the same output when a computer code is re-run with fixed input (e.g. [8]). Here, we do not refer to computational reproducibility. Our context is statistical: the reproducibility of experimental results in replication studies. We sidestep the potential confusion by laying out the definitions as we have and adhering to them for the remainder of this article. 2 Some of the ideas developed in depth here appeared in a preliminary form in [9].
royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 221042 In this article, we argue that 'failed' replications do not necessarily signify failures of scientific practice. 3 Rather, they are expected to occur at varying rates due to the features of and differences in the elements of the logical structure of experiments. By using a mathematical characterization of this structure, we provide precise definitions of and clear delineation between replication, reproducibility and openness. Then, by using toy examples, simulations and cases from the scientific literature, we illustrate how our characterization of experiments can help identify what makes for replication experiments that can, in theory, reproduce a given result and what determines the extent to which experimental results are reproducible. In the next section, we define main notions that we use to build a logical structure of experiments which help us derive our theoretical results.

The logical structure of experiments 2.1. Definitions
The idealized experiment is a probability experiment: a trial with uncertain outcome on a well-defined set. A scientific experiment where inference is desired under uncertainty can be represented as an idealized experiment. The results from an experiment can be defended as valid only if the assumptions of the probability experiment hold. One useful set-up for us is as follows: given some background knowledge K (see table 1 for reference to all notation and terms introduced in this section) on a natural phenomenon, a scientific theory makes a prediction, which is in principle testable using observables, the data D. A mechanism generating D is formulated under uncertainty and is represented as a probability model M A under assumptions A. Given D, inference is desired on some unknown part of M A . The extent to which parts of M A that are relevant to the inference are confirmed by D is assessed by a fixed and known collection of methods S evaluated at D (similar descriptions for other purposes can be found in [10,21]). Definition 2.1 of ξ captures some key distinct elements of experiments whose population characteristics can in principle be tested. These elements are not necessarily independent of each other. For example, K may inform and constrain the sets of plausible M A and S. Or it may be necessary for M A to constrain S. M A includes the sampling design when sampling a population conforming A, which we assume to be independent of sampling design. For example, A may be the description of an infinite population of interest, which may be sampled in a variety of ways to yield distinct probability models M A for the data depending on the sampling scheme.
We distinguish two elements of S: S pre and S post . S pre is the scientific methodological assumptions made before data collection and procedures implemented to obtain D. S pre captures assumptions in designing and executing an experiment such as experimental paradigms, study procedures, instruments and manipulations. Conditional on K and M A , S pre is reliable if the random variability in D is due only to sampling variability modelled by M A . S post is the statistical methods applied on D. If inferential, S post is reliable if it is statistically consistent. S is reliable if and only if S pre and S post are reliable.
We also distinguish two elements of D: D s and D v . D s is the structural aspects of the data, such as the sample size, number of variables, units of measurement for each variable, and metadata. D v is the observed values, that is, a realization conforming D s . Some statistical approaches to assess risk and loss focus on the reproducibility conditional on D v , whereas others focus on averages over independent realizations of D v . Definition 2.1 of ξ allows us to scaffold other definitions as follows. An exact replication experiment ξ 0 must generate D 0 independent of D conditional on M A in the values but with the same structure D s . Definition 2.2 mathematically isolates ξ and ξ 0 from R, the result of interest as formally defined in definition 2.3. That is, ξ 0 does not need to have a specific aim to be performed or worked with as a 3 We are not the first to take issue with the 'replication crisis' framing. We invite the interested reader to visit Feest's [20] provocative and incisive assessment of why replication is overrated.
royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 221042 royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 221042 mathematical object. The benefits of this isolation will become clear in §3, where an unconditional ξ and its non-exact ξ 0 pair may become a ξ and its exact ξ 0 pair, conditional on R.
Often, however, we would perform experiments with a specific aim and would like to see whether the result of ξ is reproduced in ξ 0 . Depending on the desired mode of statistical inference, example aims include hypothesis testing, point or interval estimation, model selection or prediction of an observable. Further, when augmented with an R, K 0 must differ from K in a specific way. Encompassing all these statistical modes of inference, we introduce the notion of a result R, as a decision rule. For convenience, we assume that R lives on a discrete space here. Definition 2.3. Let X be the sample space and R ; fr 1 , r 2 , . . . , r q g, q [ Z þ be the decision space. For sample size n [ Z þ , the function R : X n ! R is a result.
R is obtained by mapping the application of S post on D on to the decision space. If ξ 0 is aimed at reproducing R of ξ, it is conditional on R and leads us to the following connection between an idealized experiment and a result.
In definition 2.4, reproducibility of R depends on the available actions r 1 , r 2 , …, r q . The size of q is case specific. Examples are as follows. In a null hypothesis significance test, q = 2: the null hypothesis and the alternative hypothesis. In a model selection problem, we entertain q models and choose one as the best model generating the data. In a parameter estimation problem for a continuous parameter, we build q arbitrary bins, and call a result reproduced if the estimate from ξ 0 falls in the same bin as the result from ξ. How the bins are constructed in a problem affects the actual reproducibility rate of a result. However, for our purposes in this article, theoretical results hold for all cases regardless of this tangential issue.
The class of problems of interest to us here involves cases where, in a sequence of exact replication experiments, if S is reliable, we should expect a regularity in the results. That is, probability theory tells us that if the elements of an idealized experiment are well defined, then we should expect the results from a sequence of replication experiments to stabilize at a certain proportion, given the characteristics of an idealized experiment and the true data-generating mechanism. This notion is formalized in definition 2.5.
Definition 2.5. Let ξ (1) , ξ (2) , …, ξ (N ) be a sequence of idealized experiments. The reproducibility rate of a result, R = r o is a parameter of the sequence (I {C} = 1 if C, and 0 otherwise).
An advantage of definition 2.5 is that conditional on R = r o in ξ and a sequence of replication experiments ξ (1) , ξ (2) , …, ξ (N ) , the relative frequency of reproduced results ϕ N converges to ϕ ∈ [0, 1] as N → ∞. So, we immediately have f N ¼ N ðÀ1Þ P N i¼1 I fR ðiÞ ¼rog as a natural estimator of ϕ. Further, we are formally comforted to know that lim N!1 Pðf N ¼ fÞ ¼ 1. That is, with high probability, the estimated reproducibility rate ϕ N from a sequence of replication experiments will get closer to the true reproducibility rate of the original experiment ϕ.
Finally, we turn to the last of our key concepts: openness. Openness refers to the accessibility of all necessary information regarding the elements of ξ by another idealized experiment ξ Ã . This accessibility may be used for a variety of purposes. For example, S post can be re-applied to D to verify R independently of ξ. In this capacity, openness facilitates the auditing of experimental results by way of screening off certain errors, including human and instrumental (e.g. data entry and programming errors), that may be introduced in the process of obtaining R initially. On the other hand, openness may be needed to perform an exact ξ 0 by way of duplicating S pre to obtain D 0 and S post to obtain R 0 .
In this capacity, openness makes exact ξ 0 possible.
Openness is critically related to reproducibility since the degree to which information is transferred from ξ to ξ 0 impacts the ϕ of a given result. However, not all elements of ξ need to be open for all purposes. Therefore, a nuanced understanding of openness requires evaluating it at a fixed configuration of the elements of ξ conditional on a specific purpose, rather than as a categorical judgement at the level of the whole experiment, as open or not. This leads us to think of openness element-wise, as in definition 2.6.
Definition 2.6. Let P be the power set of elements of ξ and p [ P. ξ is π-Open for ξ Ã if π , K Ã , where ξ Ã is an idealized experiment that imports information from ξ.
A specific example of π-Open of definition 2.6 would be π ≡ (M A , S pre ), where ξ Ã gets all the information about the assumptions, model and pre-data methods from ξ but no other information. Another example of π-Open is the special case where ξ has all its elements open, such that π ≡ (K, M A , S, D). In this case, for convenience, we say ξ is ξ-Open for ξ Ã .

Fundamental results on replications and reproducibility rate from first principles
Here, we present two results about reproducibility and some remarks, based on definitions 2.1-2.6. A well-formed theory of reproducibility requires results of these types: fundamental, mathematical and invoking a functional framework to study replications and reproducibility. They serve as theoretical benchmarks to check other results against. Technically oriented readers may refer to appendices A and B for a more detailed discussion and results complementary to the main argument.
We begin by noting that, given definition 2.5 and the discussion following it, it is not straightforward to say exactly what we gain if we were to update the estimated reproducibility rate based on the results obtained from performing more replications. Indeed, to understand the value of replication experiments in assessing the reproducibility of a result, a strong mathematical statement is required, which is our result 2.7.
Result 2.7. Let ξ (1) , ξ (2) , …, ξ (N ) be a sequence of replication experiments with reproducibility rate ϕ given by definition 2.5. Then, where ϕ N is the sample reproducibility rate of result R = r o obtained from the sequence (proof in appendix A).
Result 2.1 is fundamental to study replications and reproducibility for a number of reasons: 1. It provides a basis for building trust in the notion of reproducibility from replication experiments.
Roughly, it says that if we perform replication experiments and estimate the reproducibility rate of r o by ϕ N from these experiments, then we are guaranteed that deviations of ϕ N from ϕ are going to get small and stay small. 2. It is almost necessary to move forward theoretically. It immediately implies that if the assumptions of an original experiment are satisfied in its replication experiments, then we are adopting a statistically defensible strategy by continuing to perform replication experiments and updating ϕ N as a proportion of successes to assess the reproducibility rate. Therefore, result 2.7 gives us a theoretical justification of why we should care about performing more replication experiments whose assumptions are satisfied and be interested in estimating the reproducibility rate based on those replication experiments alone. Further, violating the assumptions of ξ in replication experiments implies that ϕ N converges to some ϕ defined by the flaws underlying a non-exact sequence of replications of ξ rather than the reproducibility rate of r o of interest. 3. As we will detail in result 2.9, a theoretically fertile way to study replication experiments is by defining a sequence of experiments as a stochastic process. The results from such processes almost always require the solid foundation provided by result 2.7.
Remark 2.8. The reproducibility rate given in definition 2.5 has excellent properties as shown by result 2.7. However, we keep in mind that definition 2.5 is only one way to measure reproducibility. It royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 221042 is a counting measure which counts the reproduced results. Instead, a continuous measure as a degree of confirmation of a result might seem more proper to measure reproducibility. One has to be aware that just defining a reproducibility measure does not imply that it has desirable mathematical properties. It is easy to define meaningful continuous measures of reproducibility which might have pathological properties (e.g. that do not satisfy result 2.7), and these should be avoided (see appendix A for details).
In practice, S post are functions of sample moments, such as the sample mean. In these cases, sometimes the Lindeberg-Lévy central limit theorem (CLT) and its extensions provide useful results about the properties of ξ (1) , ξ (2) , …. However, restricting S post this way constrains the mathematical setting to study the statistical properties of ξ (1) , ξ (2) , … or results reproducibility. For example, working with the CLT is challenging when S post cannot be formulated as a function of a fixed sample size or to discuss the properties of a sequence of replication experiments directly, without referring to S post as a means to estimate a particular R.
We provide a broad setting without these limitations by assuming that K requires only minimal validity conditions on M A and S. Specifically, we let M A be any probability model, subject only to some mathematical regularity conditions such as continuity of distribution functions, the existence of the mean and the variance of the variable of interest. We also let S post be the sample distribution function. 4 With the generality provided by these assumptions, we obtain one of our main theoretical results.
Result 2.9. The sequence of idealized experiments ξ (1) , ξ (2) , … given by definition 2.5 is a proper stochastic process, seen as a joint function of random sample D and of each value in the support of data-generating mechanism, x [ R (see constructive proof in appendix B).
Result 2.9 is of fundamental importance to study results reproducibility mathematically because it allows us to apply the well-developed theory of stochastic processes to build a theory of results reproducibility. Two aspects of result 2.9 are noteworthy: 1. When we obtain a random sample in ξ and perform inference using a fixed value of a statistic such as a threshold, the sequence ξ (1) , ξ (2) , … constitutes random variables independent of each other conditional on the true model generating the data. Obtaining the distributions implied by ξ helps us understand the statistical nature of replication experiments. 2. ξ 0 generates new data D 0 , and R 0 is conditional on D 0 . That is, when inference is performed for a particular replication experiment, the data are fixed. Most generally, conditional on D 0 if the empirical distribution function is R 0 , then the replication experiment estimates the model generating the data. Therefore, a replication experiment determines a sample-based estimate of a statistical model.
In the next section, we introduce a toy example as a running case study to instantiate our theoretical results on replications, reproducibility and openness.

A toy example
Our toy example involves an inference problem regarding a population of ravens, K. An infinite population of ravens where each raven is either black or white constitutes the population assumptions, A. Each uniformly randomly sampled raven can be identified correctly as black or white, which defines the pre-data methods, S pre . The result of interest, R, is to estimate the (unknown) population proportion of black ravens, p, or some function of it. We consider six distinct sampling scenarios, which lead to six distinct M A , and thus six distinct idealized experiments. To avoid overly complicated mathematical notation, we denote the models by ξ bin , ξ negbin , ξ hyper , ξ poi , ξ exp and ξ nor . These models represent the binomial, negative binomial, hypergeometric, Poisson, exponential and normal probability distributions for the data-generating mechanism, respectively. In specific examples, we also vary S post , the point estimator of the parameter of interest to take values as maximum likelihood estimate (MLE), method of moments estimate (MME) and posterior mode (i.e. Bayesian inference). We further vary D s via the sample size (i.e. n ∈ {10, 30, 100, 200}). We use these idealized experiments to illustrate our results in the rest of the article. 4 We assume that the order in which the data values appear has no bearing on the inferential goal. The cases in which the order contains information are important for a variety of subject matters, but it is well known that the statistical techniques that deal with them are too specialized to be treated in a general set-up. An example is autoregressive models.
royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 221042 These six idealized experiments make the following sampling assumptions. ξ bin stops when n ravens are sampled. ξ negbin stops when w white ravens are sampled. ξ hyper is a special case where the sampling has access only to a finite subset of the infinite population delineated by A. ξ bin , ξ negbin and ξ hyper are often called exact models, in the sense that their M A does not involve any limiting or approximating assumptions. On the other hand, ξ poi approximates ξ bin , where a large sample of n ravens is sampled when the proportion of black ravens p is small. The larger the n and the smaller the p such that np remains constant, the better the approximation. ξ exp has the same approximative characteristics and parameter as ξ poi . However, notably, ξ exp records the time between observations instead of counting the ravens, so its S pre is different from all other experiments. Finally, ξ nor approximates ξ bin where a large sample of n ravens with intermediate proportion of black ravens, p, holds.
As the result of interest, R, these six idealized experiments aim to estimate either the proportion of black ravens, p, in the population or the rate of black ravens sampled, np → λ, a function of p, in the approximative models. Figure 1 shows distinctive elements of these six idealized experiments.
In §4, we use these six idealized experiments to show that openness connects to reproducibility in a variety of ways and to reproduce a given result, and replication experiments do not need to be exact. We show that conditional on a given result from an original experiment, non-exact replication experiments can serve as valid exact replication experiments, if the inferential equivalence holds between the original and the replication. We further show that the true rate of reproducibility of a sequence of exact replication experiments and a sequence of non-exact replication experiments are distinct (except trivially) for a given result.

Element-wise openness and assessing the meaning of replications
Tools and procedures have been developed to help facilitate openness in science [11,14,17,22]. Guidelines may argue for making as much information available as possible about an experiment or leave it to intuition to guide which elements of an experiment are relevant and need to be shared for replication. We are interested in better understanding what does and does not need to be made available, in service of which objective, and under what conditions. We perceive two main issues: what openness means for performing meaningful replications and how it impacts results reproducibility. We first evaluate the former. Then we show that a uniform, wholesale framing of openness is not the remedy to the reproducibility crisis that some take it to be.
ξ has elements involving uncertainty, such as D v taken as a random variable. Uncertainty modelled by probability is always conditional on the available background information [23], and thus,  Figure 1. Six idealized experiments ξ bin , ξ negbin , ξ hyper , ξ poi , ξ exp , ξ nor : The binomial, negative binomial, hypergeometric, Poisson approximation to binomial, exponential waiting times between Poisson events and normal approximation to binomial, respectively. All but ξ hyper assume infinite population (A) of black and white ravens, with sampling designs resulting in distinct probability models (M A ). ξ hyper assumes sampling from a finite subset of the population. All experiments aim at performing inference on result (R), which reduces down to an estimate of either the population proportion of black ravens or the mean number of black ravens in the population.
royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 221042 We use definition 2.6 and specify π to assess the degree of openness in these experiments. When ξ is ξ-Open, the probability of exact replication is 1, and every node of the network is only connected to itself. If ξ is π-Open, where π is a proper subset of ξ, then ξ 0 may be a non-exact replication of ξ in various ways because ξ 0 needs to substitute in a value for elements that are not in π. Therefore, the probability of ξ 0 being an exact replication of ξ is lower than when ξ is ξ-Open. In figure 2, we show the network structures that result from choosing non-open elements with equal probability among all substitutions considered for each element. The network complexity depends on the size of π. If it is large, the number of connections among the nodes in the network is small, and each connection is strong (e.g. strongest when all open). In contrast, if it is small, the number of connections among the nodes in the network is large because there are both multiple substitutions to be made and multiple possibilities for each, and each connection is weak (e.g. weakest when M A , S post , D s not open in figure 2). Hence, as the size of π decreases, it becomes less probable to perform an exact replication of ξ. By looking at which elements of ξ are open to start with, we can assess how the sequence ξ (1) , ξ (2) , … of replication experiments can be misinterpreted if the necessary elements were not open and/or got lost in translation. In the rest of this section, we organize our results by elements K, M A , S, D.

Background knowledge, K
Providing an exact description of what goes into K is notoriously difficult. K, which is more of a philosophical element of ξ, typically carries over much more than what can be immediately gleaned over by a transparent and complete description of M A , S and D. We understand K to contain theoretical assumptions, contextual knowledge, paradigmatic principles, a specific language and presuppositions inherent in a given field; in short, a lot of inherited cultural and historical meaning of royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 221042 the kind Feyerabend refers to as natural interpretations in the Against Method ( [24], p. 49). As Feyerabend explains, such natural interpretations are not easy to make explicit or even sometimes be aware of and thus, being open about them might not be a matter of choice. However, observations gain meaning only against this backdrop, and experiments can only be interpreted correctly by using the same language used to design them in the first place. Within ξ, this tends to happen implicitly, whereas when performing ξ 0 , there is no guarantee that all the relevant information in K will carry over to K 0 .
By using the binomial experiment in our toy example, we can illustrate why K is an integral part of ξ and what role it plays for ξ 0 . In ξ bin , our aim (R) is to estimate the proportion of black ravens ( p) in an infinite population of ravens (A). M A samples n ravens. In the case of our S pre , we count black and white ravens by naked eye. In the case of our S post , we use the maximum likelihood estimator of p. We set n = 100, which constitutes our D s . This description of ξ bin based on a specific configuration of M A , S pre , S post , D s could just as well be used to define an experiment in which scientists are interested in estimating the proportion of black swans in a population of black and white swans. While ξ bin would still be mathematically well defined, its scientific content and context are not captured by any of these four elements. For that, we need K. Without K, we would have to consider an j 0 bin about black swans as an acceptable replication of ξ bin about black ravens, based on the mathematical structure alone. K, then, communicates scientific meaning across experiments.
As a more practical example of the import of K, we consider a recent 'failed' replication experiment. Murre [25] attempted to replicate a classical experiment by Godden and Baddeley [26] on contextdependent memory. Context-dependent memory refers to the hypothesis that the higher the match between the context in which a memory is being retrieved and the context in which the memory was originally encoded, the more successful the recall is expected to be. In the abstract, Murre [25] summarizes the results of the replication experiment as follows: 'Contrary to the original experiment, we did not find that recall in the same context where the words had been learned was better than recall in the other context.' Does this suggest that the results of the original experiment were a false positive-as replication failures are commonly interpreted? There are many reasons to not jump to that conclusion including sampling error and the fact that the context of the replication was different from that of the original [26] experiment. Specifically, unlike the original, the replication was being filmed as part of a TV programme. We will set these obvious concerns aside for a moment to focus on another. Ira Hyman explains the issue in a Twitter thread [27]. Hyman indicates that the phenomenon of context-dependent memory is conditional on the distinctiveness of the encoding context. That is, if distinct contexts are used over multiple trials, the chances that the context will be remembered with the encoded information increases. When the context is not distinctive enough or remains constant over trials, the effect disappears. Another known boundary condition for the phenomenon is the outcome variable: past research has shown that this works for retrieval tasks (e.g. free recall) and not recognition. The Murre [25] replication did not carry over these contextual details and changed the design in a way to not instigate context-dependent memory. As a result, the differences between R and R 0 become impossible to attribute to a single cause and fail to provide evidence that can confirm or refute the results of the original Godden and Baddeley [26] experiment. It is even questionable whether the Murre [25] experiment provided an appropriate test of the result of interest in the first place to be considered a meaningful or relevant replication. This replication example on context-dependent memory appears to imply that a ξ 0 is meaningful or relevant with respect to a specific result R. By definition 2.2 and its interpretation, however, we know that mathematically, it is more convenient to separate the definition of ξ 0 from R. It follows that there are at least two aspects of assessing the meaning and relevance of a replication. Firstly, while an operational definition of K is elusive, a useful way to think about K is 'all the information in ξ that is not already in M A , S and D'. At the minimum, for ξ 0 to be considered a meaningful replication of ξ, K 0 must import some information in K regarding the immediate scientific context of ξ. For this to hold, there is no need to invoke the notion of R. Second, to assess the reproducibility of a given R, K 0 must import relevant information pertaining to R from ξ. That is, replication experiments unconditional and conditional on R are not the same objects. To emphasize the difference between them, we distinguish between in-principle and epistemic reproducibility of an R in remark 4.1 (for further details, see appendix C).
Remark 4.1. Let ξ be an idealized experiment and ξ 0 be its exact replication. Conditional on R from ξ, royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 221042 In practice, ξ 0 can never be an exact replication of ξ in an ontological sense. The ξ is a one-time event that has already happened under certain conditions and ξ 0 has to differ from ξ in some aspect. The best standard that ξ 0 can purport to achieve is to capture relevant elements of ξ in such a way that performing inference about R while adhering to A and sampling the same population is possible within an acceptable margin of error. However, every experiment is embedded in its immediate social, historical and scientific context, making it a non-trivial task for scientists to include all the relevant K when they report the experiment in an article and make explicit all the natural interpretations used to assign meaning to its results. As such, designing and conducting replication experiments cannot be reduced to a clerical implementation of reported experimental procedures. A comprehensive understanding of K is increasingly critical as ξ 0 diverges further away from ξ to be able to comprehend the nature and importance of the divergence for the interpretability of ξ 0 and for results reproducibility. For ξ 0 to serve their intended objective, information readily available from ξ 0 needs to be supplemented by a careful historical and contextual examination of the relevant literature and the broader scientific background. Otherwise, ξ 0 may differ from ξ in non-trivial ways impacting the meaning of the evidence obtained and changing the estimated reproducibility rate.

Model, M A
For ξ 0 to be able to reproduce all possible R of ξ, M A must be specified up to the unknown quantities on which inference is desired. This specification must be transmitted to ξ 0 , such that M A and M 0 A are identical for inferential purposes mapping to R. If an aspect of M A that has an inferential value mapping to R is not transmitted to ξ 0 and this inferential value is lost, then R cannot be meaningfully reproduced by R 0 . On the other hand, given an inferential objective mapping to a specific R, the aspects of M A that are irrelevant to that inferential objective need not be transmitted to ξ 0 to meaningfully reproduce R by R 0 . is that there exists a one-to-one transformation between M A and M 0 A for inferential purposes mapping to R (proof and details in appendix D).
As an example of result 4.2, consider ξ bin and ξ negbin in figure 1. Conditional on the objective of estimating p, the population proportion of black ravens, any of (ξ bin , ξ bin ), (ξ bin , ξ negbin ), (ξ negbin , ξ bin ) and (ξ negbin , ξ negbin ) can be effectively considered a pair (ξ, ξ 0 ) of an idealized experiment and its (exact) replication. The reason is that the quantity of interest p is an identifiable parameter in both experiments, although M A and M 0 A are not necessarily identical. 5 In practice, when conducting a sequence of replication experiments, we would be interested in gauging the extent to which we can reproduce a specific result. Assuming that S are the same throughout all experiments, we expect the observed reproducibility rate of a sequence of experiments whose elements are chosen from ξ bin , ξ negbin to converge on the same value, capturing the information on p, in the same way. However, result 4.2 does not imply that the (true) reproducibility rate of any two sequences of experiments involving any M A and M A is not equivalent to M A . However, the binomial and the negative binomial models become equivalent with respect to a certain inferential objective that allows for reproducing a specific R, which is estimating p. To establish this compatibility, M A should be open to ξ 0 but does not need to be assumed in ξ 0 . Specifically, to set M 0 A to be the negative binomial model in ξ 0 to reproduce the estimate of p in ξ, we need to know that ξ has used the binomial model. This ensures that ξ 0 can use a model that has the same parameter p with the exact same meaning as in ξ and same population assumptions A such that the inferential equivalence holds. A model that has different population assumptions A from ξ bin and ξ negbin is ξ hyper . This difference matters for reproducing a specific R. ξ hyper samples from an arbitrary finite subset of infinite population but still uses the same parameter p as ξ bin and ξ negbin . The estimate of p in ξ hyper will be biased due to differences in A. Without access to full specification of M A , this compatibility between M A and M 0 A or lack thereof cannot be established. 5 Compare this statement to definition 2.2 of an exact and non-exact replication experiment unconditional on an inferential objective.
royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 221042 This point is illustrated in many-analyst studies [28,29] in which a fixed D is independently analysed by multiple research teams who are provided D and a research question that puts a restriction on which R would be relevant for the purposes of the project. The teams were not, however, provided a M A , S post or full specification of K. Teams used a variety of models differing in their assumptions about the error variance and the number of covariates (M A ) to analyse D. The results differed widely with regard to reported effect sizes and hypothesis tests. So even when D was open, the lack of specification with regard to M A yielded largely inconsistent results. It is not because the same aspects of reality cannot be captured by different models but because researchers did not automatically agree on which aspects to capture in their models.
Taking stock, our ravens example is deliberately simple to help in our analysis. State-of-the-art models are often large objects. If M A is large, it might not always be clear which class of models M 0 A can be drawn from to be equivalent to M A , and finding this class might be unfeasible. Then M A needs to be both open to and photocopied by ξ 0 to be able to reproduce the results of interest. This point is particularly important to communicate to scientists who primarily engage in routine null hypothesis significant testing procedures and may not be conventionally expected to transparently report their models. Pertaining to mathematical features of the variables of interest, S pre may capture their types or a particular scaling. For example, a variable can be assumed discrete, continuous, or both discrete and continuous for mathematical convenience. This choice determines whether we are bound by a counting measure or a Lebesgue measure. A variable can also be assumed categorical, ordinal, interval or ratio. Some variables or parameters are scaled to the interval [0, 1] on the real line, to make their interpretation natural. All of these S pre choices affect M A and the consequent S post .
Pertaining to operational features of the variables of interest, S pre may capture the method of observation and measurement instruments. In our toy example, a raven can be observed for its colour by naked eye (S pre ), but another investigator may opt for a mechanical pigment test ðS 0 pre Þ. What considerations should be given when making substitutions for S pre ? One issue due to choices in operationalization is measurement error. Measurement error in observables, when not accounted for, might be a factor unduly exacerbating irreproducibility or inflating reproducibility [10,32,33]. Another issue arises due to arbitrary choice of experimental manipulations or conditions which might not be mathematically equivalent. For example, manipulations that are not tested for specificity may end up manipulating non-focal constructs or only weakly manipulate the focal construct (i.e. leading to small effect sizes) [34].
Even though knowing all these features is useful in understanding S pre , there is a caveat. All aspects of S pre must be fixed before realizing D v , and it is challenging to assess a priori whether ξ and ξ 0 using different S pre and S 0 pre , respectively, can be equivalent to each other. Due to these complexities and ambiguities surrounding S pre , openness of S pre seems to be the easiest way to obtain an equivalent S  As an example of result 4.3, consider models ξ poi and ξ exp in figure 1. ξ poi has a good approximative model to the model in ξ bin if we think of sampling ravens continuously from a population where black ravens are rare. We assume np → λ, where λ is the rate of sampling the black ravens (parameter of the Poisson model), and under this assumption, we focus on inference on λ. Now, as a thought experiment, let us assume that we do not have a device to count the number of black ravens past 1. 6 Cooper and Guest [30] and Guest and Martin [31] make a similar point for computational reproducibility. They highlight the importance of making models available, and particularly clearly reporting model specifications and implementation assumptions so as to facilitate replication.
royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 221042 However, we have a chronometer. As a result of using the model in ξ poi , we are, as a mathematical fact, also using the model ξ exp , which measures the time between observing black ravens. Further, the two models have the same parameter, with the same interpretation. Therefore, if we were to measure the time between observing black ravens for a sample, then we can still perform inference on the rate of observing black ravens from the population. We note that ξ bin , ξ negbin , ξ hyper , ξ poi and ξ nor operate under different assumptions, but are still counting ravens and interested in the number of black ravens. In contrast, ξ exp is considerably different from these experiments. It is not counting ravens, but measuring time, which we would reasonably define as a continuous variable. While S pre in ξ exp differs considerably from all other experiments in our toy example, the exponential experiment would serve as a meaningful ξ 0 to reproduce R in any of them, at least approximately.

Statistical methods, S post
Statistical methods, S post , that are designed for a specific inferential goal, R, but do not return identical values when applied to a fixed D are common. Conversely, some statistical methods return identical values for a specific inferential goal, R, and they are mathematically equivalent conditional on D, even though they operate under distinct motivating principles. We have the following result. does not need to be duplicated to establish equivalence. For example, to use MME to estimate p in ξ 0 , we need to know that ξ has used MLE or MME. This way, we can ensure that ξ 0 will at least use a numerically equivalent estimator as the one used in ξ, even if not equivalent in principle. On the other hand, it is well known that a variety of S post for the same mode of inference may yield different R. The many-analyst project by Silberzahn et al. [29] provides clear examples of this. Teams that were given a fixed D to analyse for a predetermined R  As an example of result 4.5, we consider the models in ξ poi and ξ exp in figure 1. Poisson model counts the black ravens as observable. It assumes that black ravens are observed with a constant rate. Exponential model measures the time between arrivals of black ravens. It also assumes that black ravens are observed with a constant rate. By referring to the unit of observations, we see that the data structures in ξ poi and ξ exp are distinct. And yet, the unknown parameter about which inference is desired is the same, λ-the rate of black ravens appearing in continuous sampling (appendix F).
As another example, note that the stopping rules of ξ bin and ξ negbin are different from each other. The stopping rule affects D s because the maximum number of black ravens in ξ bin is n, but in ξ negbin , it is the maximum number of black ravens in the population. And yet, the estimate of p is the same in both experiments.
royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 221042 Data sharing is sometimes viewed as a prerequisite for a reproducible science [8,13,35,36]. Our analysis suggests that this statement requires further qualification and calls for attention to D s . Result 4.5 notwithstanding, changes in D s are not trivial and they impact the true reproducibility rate. For example, ξ 0 might be designed to have a larger sample size than that of ξ. In this case, the variance of the sampling distribution of the sample mean decreases linearly with the sample size, and hence, it would be different for ξ and ξ 0 . Typically, larger sample sizes are pursued to increase the statistical power of a hypothesis test in ξ 0 . While such ξ 0 will indeed increase the power of a test, it also impacts the reproducibility rate. Counterintuitively, under some scenarios, this might play out as reproducing false results with increased frequency (see [10], for such counterintuitive results).

Data values, D v
Having open access to D v has no bearing on designing and performing a meaningful ξ 0 or on the reproducibility of R. Conditional on R, ξ 0 aims to reproduce R, not D v . Therefore, reporting R from ξ is sufficient for ξ 0 to assess whether R is reproduced by R 0 . However, information from ξ can be reported in a variety of ways and does not necessarily contain R. We show this with an example. We  That said, openness of D v might facilitate auditing of R and vetting it for errors. There may be other benefits to open D v such as enabling further research on D v (e.g. meta-analyses). The distinction we draw matters particularly when there may be valid ethical concerns regarding data sharing [37]. Open D v is best evaluated on its own merits as has been discussed extensively [38] but cannot be meaningfully appraised as a facilitator of replication experiments or precursor of results reproducibility. While some level of open scientific practices is necessary to obtain reproducible results, open data are not a prerequisite.

Exact versus non-exact replications: a simulation study on reproducibility rate
So far we have established that to reproduce R, all elements of ξ do not need to be open, and not all elements that are required to be open need to be duplicated for a meaningful ξ 0 . On the flip side, we also established that relatively simple openness considerations such as experimental procedures, hypotheses, analyses and data will not suffice to make ξ 0 meaningful. The challenge in making π-openness useful for replication experiments is to clearly identify and delineate the elements of the idealized experiment. For example, proper K is difficult to define and communicate with precision. Also, M A is at times conflated with S post and left opaque in reporting. As we discussed earlier, making K explicit and clearly specifying M A up to its unknowns is critical when designing ξ 0 .
Hitherto, we focused on replication experiments and only alluded to results reproducibility when needed. In this tack, we have mathematically isolated ξ from R and made some statements about ξ unconditional, and then conditional on R to emphasize their difference. Now that we turn our attention to explicitly drawing the link from replications to reproducibility, we condition R on ξ.
Given a sequence of exact replication experiments ξ (1) , ξ (2) , … and a result R from an original experiment ξ, do we expect to confirm R with high probability irrespective of the elements of ξ? The answer is 'no' as shown elsewhere [10,21]. The true reproducibility rate of a result is a function of not only the true model generating the data but also the elements of the idealized experiment. ξ may be characterized by a misspecified M A (e.g. omitted variables, incorrect formulation between variables and parameters), unreliable S pre (e.g. measurement error, confounded designs, non-probability samples), unreliable S post (e.g. inconsistent estimators, violated statistical assumptions), errors in D (e.g. recording errors), or large noise-to-signal ratio (e.g. large error variance and small expected value). All of these lead to the mathematical conclusion that the true reproducibility rate ϕ is specific royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 221042 to each configuration of ξ and thus can take any value on [0, 1]. Therefore, ϕ tells us more about the experiment itself than some unobserved reality that is presumed to exist beyond it. Since we are now conditioning on ξ and questioning the reproducibility rate of R, the conclusion is that while a degree of openness may be able to address a 'replication' crisis by facilitating faithful replication experiments, it does not suffice to solve any alleged 'reproducibility' crisis.
Openness of elements of ξ facilitates ξ 0 , thereby allowing us to estimate ϕ of R by ϕ N conditional on ξ.
However, ϕ cannot be reasonably used as a target of scientific practices where each ξ is designed to maximize it. It does not make sense to think that a ξ that returns the highest reproducibility rate for a given R is scientifically most relevant or most rigorous experiment. For example, choosing an S post that always returns the same fixed value regardless of D v would yield ϕ = 1. In fact, ϕ can be made independent of what it would be under sampling error. 7 A reasonable expectation from ξ 0 is to deliver a scientifically relevant estimate of ϕ, given R. Openness plays an important role in this regard. In §4, we established that any non-open elements of ξ would need to be substituted for in ξ 0 , leading to a non-exact replication. The following result states how a sequence of non-exact replications alter the reproducibility rate. Result 5.1 states that the true reproducibility rate to which the estimated reproducibility rate of a sequence of non-exact replication experiments converges is the mean reproducibility rate of results from all experiments in the non-exact sequence and not the true reproducibility rate of a fixed original result. Hence, the reproducibility rate is a function of all elements of the idealized experiment, for both a fixed original experiment and all its replications. Each replication that is non-exact in a different way from others introduces variability, decreasing the precision of estimates given a fixed number of replications.
We illustrate the link between replication experiments and reproducibility rate with a simulation study. We consider a series of exact and non-exact replication experiments to analyse the variation in the reproducibility rate of a result as a function of the elements of ξ. We use sequences of two idealized experiments ξ poi and ξ nor , which are approximate models to binomial from our toy example. For all conditions, we fix the true proportion of black ravens and the number of trials in the exact binomial model at 0.01 and 1000, respectively. These arbitrary choices make the true reproducibility rate distinct under ξ poi and ξ nor . As R, we choose a point estimate for the location parameter of the probability model. For convenience, we assume that the parameter estimates of the original experiments are equal to the true value. After each replication experiment, we determine whether this result is reproduced by  Figure 3a,b shows 100 independent runs of a sequence of 1000 exact replication experiments under these conditions, for ξ poi and ξ nor , respectively.
In non-exact replications, we vary the set from which the replication experiment is uniformly randomly chosen from in each step. This results in additional three conditions: a set of all eight idealized experiments, a set of four idealized experiments with lowest reproducibility rates and a set of four idealized experiments with highest reproducibility rates. Figure 3c shows 100 independent runs of a sequence of 1000 non-exact replication experiments under these conditions.
We emphasize that all parameters of the simulation example in figure 3 are chosen so that the implications of differences between different models, methods and data structures make the link between replications and reproducibility explicit. It is certainly possible to choose these parameters to obtain any true reproducibility rate defined by a specific ξ since ϕ ∈ [0, 1].
Conditional on R, some conclusions from figure 3 are as follows.
1. The true reproducibility rate depends on the true data-generating mechanism and the elements of the original experiment. Specifically, the true reproducibility rate in our simulation is a function of the true model generating the data, M A , and also D s such as the sample size, and S post such as the method of point estimation. This can be seen from exact replication sequences of eight idealized experiments in figure 3a,b, with the true reproducibility rate for each experiment indicated by stars. 2. By weak law of large numbers, even if the true reproducibility rate is high (e.g. orange in figure 3a and green in figure 3b), the estimated reproducibility rate from a short sequence of exact replications has higher variance than the variance of the estimated reproducibility rate in a longer sequence. However, the estimated reproducibility rate from exact replications ultimately converges to the true reproducibility rate of an original result from a fixed ξ illustrating result 2.7. In practice, however, we do not have access to the true reproducibility rate of any idealized experiment to help determine our replication sets. We have to make our decision based on the elements of the idealized experiment instead, and that requires a thorough understanding of how each element of the idealized experiment impacts the reproducibility rate in a given situation. 5. The variance of the estimated reproducibility rate of results in a sequence of non-exact replications can be higher or lower than the variance of the estimated reproducibility rate in a sequence of exact replications of the original experiment. The pattern of variances we observe in figure 3d is a direct consequence of nϕ following a binomial distribution and result 5.1. As a mathematical fact of the binomial distribution, its variance is maximum at ϕ = 0.5 and decreases as the probability of success, ϕ, gets closer to 0 or 1. Hence, we expect our estimates to vary greatly in a sequence of non-exact replication experiments with moderate true reproducibility rates. If a sequence of nonexact replications come from a homogeneous set of very high (or very low) true reproducibility rates, we expect our estimates to vary little. On the other hand, we expect highest variation in our estimates from exact replications if ϕ = 0.5 from the original experiment and from non-exact replications if they are highly heterogeneous in their true reproducibility rates.
In sum, the mere choice of the elements of ξ impacts both the level of the true reproducibility rate and the variance of the estimated reproducibility rate. Any divergence in ξ 0 may move the estimated reproducibility rate away from the true value for an original result and increase the variance of its estimates. In appendix H, we provide a broader example for result 5.1 in the context of linear regression models, under a model selection (rather than parameter estimation) scenario, where both true and false original results are considered. This simulation study demonstrates a similar pattern of results to those presented in figure 3. Combined, simulation results confirm that reproducibility rate can take any value on [0, 1] depending on the elements of ξ even when the original experiment indeed captures a true result, there is no scientific malpractice, and meaningful replication experiments can be performed to reproduce R.

Discussion
In this article, we focused on scientific experiment as the critical unit of analysis, formalizing the logical structure of experiments towards building a theory of reproducibility. We clarified what makes for a meaningful replication experiment even when an exact replication experiment is not possible and established how openness of different elements of the idealized experiment contribute to it. We distinguished between the ability of a replication experiment to reproduce a result and the true reproducibility rate for that result. We showed that theoretically it is not possible to justify a desired level of reproducibility rate in a given line of research and to reach a high level of reproducibility rate via eliminating malpractice, requiring open procedures or data, or performing replication experiments. We understand the potential lack of enthusiasm of the practitioner when they may find that the theory we develop does not have immediate application on their scientific practice. Our work is theoretical and is meant to present a framework to understand and study the objects and products of science. It is not meant to provide solutions to immediate problems scientists face. Practitioners often turn to theory for a clear answer to their difficulties in real-life studies. Our goal is not to provide these answers. We lay the groundwork that would potentially be needed to address such problems in the indefinite future, but the theoretical work is slow and incremental. In our simulations, we can create perfect conditions to illustrate our theoretical results because we set our own model parameters, and we know what our models are and what they mean, perfect transparency exists and there are no misunderstandings because every aspect of scientific objects is precisely known. All mathematical and statistical problems-excepting paper-and-pencil exact solutions-are primarily studied this way. Science as practised, on the other hand, is messy, ambiguous, loose, hard to define and communicate. There is no easy or direct translation of our work to myriad imperfections of the scientific practice. Our idealizations are removed from reality to make theoretical work possible in the first place. We are only laying the building blocks of such a theory to make practical implementations possible in the future. All this does not mean theory is currently of no practical relevance, however. In fact, we think royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 221042 that without a thorough understanding of mathematical implications of reproducibility and replications, we cannot be ready to interpret the results and solve problems that arise in practice.
With this constraint in mind, we discuss key theoretical insights from our findings.

Reproducibility and the search for truth
A layperson understanding of reproducibility to the effect that 'if we observe a natural phenomenon, we should be able to reproduce it and if we cannot reproduce it, our initial observation must have been a fluke' is exceedingly misleading. A statistical fact is that reproducibility is not simply a function of 'truth'. This was illustrated in [21] and proved in [10]: true results are not perfectly reproducible, and perfectly reproducible results are not always true (see appendix I for proof ). True reproducibility rate of a result and the variability in its estimator are determined by many factors including but not limited to the true data-generating mechanism: The degree of rigour of the original experiment as assessed by the extent to which its elements are individually reliable and internally compatible with each other, the degree to which replication experiments are faithful to the original and how any discrepancies impact the results, the degree of rigour of the replication experiment wherever it diverges from the original and how we determine for a result to be reproduced. Factors such as effect size, sampling error, missing background knowledge and model misspecification [39,40] could render true results difficult to reproduce. As a useful reminder, sampling error might be masked by the choice of method and other elements of the idealized experiment. A false result could be 100% reproducible due to the choice of estimation method. Therefore, judgements of reproducibility cannot exclusively be used to make valid inference on the truth value of a given result (see also [41], for a computational model with a similar conclusion).
Even if some form of a perfect experiment that captures ground truth and its exact replications exist, it might take many epistemic iterations of theoretical, methodological and empirical research to achieve them (see [42], p. 45, for a detailed discussion on epistemic iteration). We cannot expect to skip the arduous iterative process of doing science and hope to arrive at a non-trivially reproducible science with procedural interventions. In most fields and stages of science, focusing on maximizing reproducibility seems like a fool's errand. For meaningful scientific progress, at the minimum, we should take care to properly analyse the elements of the original experiment to assess how they might impact the true reproducibility rate and analyse the discrepancies of replication experiment(s) from the original to gauge how our reproducibility estimates may vary from the true value of the original result's reproducibility. In the course of 'normal science' (borrowing terminology from [43]), reproducibility of a result is more likely to tell us something about the experiments that generated the result and its reproducibility rate estimates than the lawlikeness of some underlying phenomenon.

Defining reproducibility
One aspect of reproducibility that often gets overlooked: how we define and quantify a result and its reproducibility also determines the true reproducibility rate (see also [44] for a discussion of different statistical methods to assess reproducibility and their limitations). For example, in a null hypothesis significance test, we might call a 'reject' decision in a replication experiment a successfully reproduced result if the original experiment rejected the hypothesis. On the other hand, we might instead look at whether effect size estimate of the replication experiment falls within some fixed error around the point estimate from the original experiment. Everything else being equal, the true reproducibility rates are expected to be different between these two cases using different reproducibility criteria.
Our findings hold under mathematical definitions of a result (definition 2.3) and of reproducibility rate (definition 2.5). In the absence of such theoretical precision, we often resort to heuristic, common sense interpretations of terms. In appendix A, we present a detailed argument on why and how theoretical precision matters and provide an example of a plausible measure of reproducibility without desirable statistical properties. Such lax standards in definitions invite unwanted or strategic abuse of ambiguities when interpreting replication results when we have a limited understanding of what we should expect to observe. Our surprise at 'failed' replication results or delight in 'successful' ones may not be warranted, and what we observe could simply be a theoretical limitation imposed by our definitions rather than a reflection of the true signal that presumably exists in nature. For an extreme example, consider the following: we might call a result as reproduced if the replication effect size estimate falls on the real line. That would trivially give us a 100% reproducibility rate.
royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 221042 Whenever we evaluate replications and estimate reproducibility, it is incumbent on us to understand how we define our results, how we determine reproducibility and how our measures should be expected to behave under specific conditions.

Reproducibility and openness
Open practices in science have been intuitively proposed as a key to solving the issues surrounding reproducibility of scientific results. However, a formal framework to validate this intuition has been missing and is needed for a clear discussion of reproducibility. The notion of idealized experiment serves as a theoretical foundation for this purpose. By using this foundation, we have distinguished the concepts of replication and reproducibility, showing how openness is related to meaningful replications. We have also distinguished between two types of reproducibility (appendix C). Whether elements from one experiment carry over to a replication experiment is only relevant to epistemic-as opposed to in-principle-reproducibility. In practice, however, resource constraints determine the availability and transferability of information between experiments. A realistic framework needs to provide a refined sense of which elements of an experiment need to be open to reproduce a given result, as opposed to simply saying 'all of it'.
We have identified different levels and layers of openness and examined their implications. An experiment that is completely open in all elements does not necessarily lead to reproducible results and an experiment that does not open its data does not necessarily hinder replication experiments. Nevertheless, irreproducible results sometimes raise suspicion and discussions turn towards concerns regarding the transparency of research or validity of findings. These discussions are typically driven by heuristic thinking about replications. Such heuristics might not hold and can lead to erroneous inferences about research findings and researchers' practices. To move the needle forward, we have provided a detailed evaluation of which elements of an experiment need to be made open relative to some objective, and which do not. For example, while necessary to audit the results of a given experiment, data sharing is not a prerequisite for performing replications or reproducing results (contrary to some suggestions, for example by [13]), but other elements of an experiment are. On the other hand, reporting model details, such as modelling assumptions, model structure and parameters, becomes critical for improving the accuracy of estimates of reproducibility. Notably, even in recent recommendations for improving transparency in reporting via practices such as preregistration, models are typically left out while transparency of hypotheses, and methods and study design are emphasized [45,46]. Also noteworthy is that some degrees of openness are difficult to attain, such as fully open background knowledge, often causing practical constraints to limit our choices for replication experiments.
When critical elements of an original experiment are not open, replication researchers would be forced to introduce substitutions in their experimental designs. Such substitutions, as we have illustrated, characterize non-exact replications and will probably alter reproducibility rates in different directions, contributing to the challenge of interpreting replication results. Strong theoretical foundations and well-defined shared empirical paradigms in a given area of research could help generate meaningful substitutions whose downstream consequences on inference are well understood.

Choosing non-exact replications
Assuming a sequence of perfectly repeatable experiments is a theoretical convenience-one that especially frequentist statistics enjoys greatly. In scientific practice, we lack the luxury provided by this assumption. Exact replications are practically impossible. Understanding the implications of result 5.1 is crucial in this respect. It states that any sequence of non-exact replications converges to a true reproducibility rate. This rate may or may not be scientifically meaningful for a specific purpose. Especially for a sequence of non-exact replications, it is hard to find a scientifically meaningful interpretation of what the reproducibility rate shows, even when it is high.
A proper understanding of the elements of the original experiment needs to precede any replication design. And wherever divergences from the original experiment are inevitable, we should strive to theoretically match new design elements to the original ones if our objective is to reproduce an original result. When that is not possible, simulations varying the degree and nature of these divergences would inform us on their impact on the reproducibility rate and can provide guidance in designing non-exact replication experiments. A lack of theoretical understanding in this regard poses significant constraints on the interpretability of replication results.
royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 221042 In cases where the original experiment suffers from design issues that make results predictably less reproducible, it is advisable to iteratively work toward improving the configuration of the idealized experiment first before attempting any non-exact replications [20]. If there is nothing there to revisit, we might be better off saving our scientific curiosity and resources for more fruitful avenues. In fact, there is room for major theoretical advancements on why and how to choose replications.

Reproducibility of a result versus accumulation of scientific evidence
We hope that advancing theoretical understanding of results reproducibility helps delineate how and why it is different from other quantities that aim to measure the accumulation of scientific evidence. The notion of reproducibility is unique in the sense that it is anchored on the results of an initial experiment. To the contrary, meta-analytic effect size estimates focus on an underlying true effect, after accounting for variation between studies being meta-analysed while robustness tests aim to assess to what extent estimated quantities of interest are sensitive to changes in model specifications. It is a widespread interpretation that reproducibility also speaks to the reliability or validity of an underlying true effect and can reasonably be used as a measure of evidence accumulation. It should be clear by now that this is a misconception. Truth certainly plays a role in reproducibility of a given result but not (always) too loudly, as reproducibility primarily captures patterns specific to the original experiment. A replication experiment in reference to an original result is a particular kind of an idealized experiment that has the capacity for achieving certain scientific objectives, such as confirming a theoretically precise prediction under well-specified conditions (i.e. attempting to account for sampling error as a last source of uncertainty after everything else has already been accounted for) or estimating the reproducibility rate of a particular result of a given experiment. For other scientific objectives, such as to make an initial scientific discovery, to pinpoint the conditions under which a precise and reliable signal can be captured, to aggregate evidence for a theorized phenomenon or to gauge the robustness or heterogeneity of an observed phenomenon across contexts, there are other idealized experiments better suited to the task than replications [20,41] such as systematic exploratory experimentation [47], metastudies [48], multiverse analyses [49], meta-analyses and continuously cumulating meta-analyses [50]. 8 The fact that scientists still care to meticulously design their experiments to be informative and meaningful has more to do with other scientific values and objectives than reproducibility.
In a sense, accumulation of scientific evidence in support of a finding requires epistemic iterations and confirmation by independent approaches and methods to achieve specific scientific objectives (e.g. discovering a new phenomenon, explaining a mechanism, predicting a future observation). This process leads to gradually eliminating uncertainty and enhancing our confidence in our theories and observations. On the other hand, attempts at reproducing a given result in replications prioritize understanding and fine-tuning the logical structure of experiments, which we see as human datagenerating mechanisms. Proper appreciation of this aspect of reproducibility is capable of guiding us in the right direction in our struggle to design more rigorous and informative experiments under uncertainty.

Concluding remarks
The discourse on scientific reform and metascience has so far pursued a 'crisis' framing, focusing on behavioural, social, institutional and ethical failings of the scientific endeavour and calling for immediate institutional and collective action. Our analysis shows that neither elimination of scientific malpractice nor actively encouraging replication experiments would necessarily improve the reproducibility of results. Because irreproducibility, when formally defined, appears to be an inherent property of the scientific process rather than a meaningful scientific objective to pursue. While reproducibility rate is a parameter of the system and thereby a function of truth, that view of the concept misses the big picture-that reproducibility reflects the properties of experiments. We perceive two issues with advancing a replication/reproducibility crisis narrative: 1. Conflating replication and reproducibility creates an inaccurate impression that these two alleged issues of not being able to conduct informative replication experiments and not being able to reproduce results are indistinguishable issues that can be addressed via similar solutions. 8 We have deliberately excluded multi-site replications from this list as there are reasons to suspect that, as they are practised, multi-site replications are not necessarily appropriate for the purposes of a robustness check for reasons detailed in [51]. This is largely on account of each replication being a non-exact replication in a unique and uncontrolled way.
royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 221042 2. Framing irreproducibility as a crisis implies that there is an ideal rate of reproducibility we should expect or strive to achieve in a given field at a given time, and we are falling short of this ideal standard.
Our mathematical results firmly argue against both of these misconceptions. Shifting the discourse on scientific reform and metascience towards greater theoretical may help change the course of science. Instead of prioritizing crisis management measures, progress can be made by falling back on fundamental issues and working our way from the bottom up. That may require individual scientists to take a step back and reassess the way they have been practising science. Circling back to our original premise, we emphasize that the problem is conceptual: the logical structure of experiments is not well understood and how experiments relate to reality gets misconstrued. Experiments are data-generating machines, and each element outlined in this work determines what kind of data they will generate. Gaining clarity with regard to how experiments impact the observed reality and properly assessing the empirical value of a given experiment for a given objective should precede concerns regarding possible replications. Theory of reproducibility is a step in this direction. All authors gave final approval for publication and agreed to be held accountable for the work performed therein.
Conflict of interest declaration. We declare we have no competing interests.
Funding. This study was supported by the National Institute of General Medical Sciences of the National Institutes of Health (Award no. P20GM104420).
Acknowledgements. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Appendix A
Proof. Proof of result 2.7 and an example pertaining to remark 4.1 that meaningful continuous measure of reproducibility is nonetheless pathological. Result 2.7 is a consequence of Strong Law of Large Numbers. An easy proof relies on Kolmogorov's almost everywhere convergence which states that a sequence of independently and identically distributed random variables with finite mean converges almost surely to a constant if and only if that constant is the expected value of random variables. The sequence ϕ (1) , ϕ (2) , …, ϕ (N ) obtained from ξ (1) , ξ (2) , …, ξ (N ) (respectively) satisfies Kolmogorov's. By definition 2.5, ϕ N ∈ [0, 1] and ϕ i are independent of each other and identically distributed and the expected value is E(ϕ N ) = ϕ < ∞, proving result 2.7. Importantly, remark 4.1 cautions us that result 2.1 does not hold for all measures of reproducibility. A well-defined ξ and ϕ are prerequisities for result 2.7 to hold. We use a counterexample with a continuous measure of reproducibility to clarify this point. As opposed to a 0-1 measure such as ϕ N , we consider a (maybe) more desirable measure of reproducibility rate, perhaps a degree of agreement between the results of ξ and ξ 0 to assess whether r o from ξ is reproduced in ξ 0 . One way to represent this degree of agreement is to replace the indicator function in definition 2.5 with a function of a continuous random variable. For example, for a sequence of idealized experiments ξ (1) , ξ (2) , … we might define Y (i+ 1) /Y (i) , where Y (i) ∼ Nor(0, σ) is a centralized statistic from ξ (i) , as score on how extreme is a specific result with respect to an original result Y (o) . Here, Y (i) are independent and identically distributed random variables conditional on ξ (i) . The set-up is such that if Y (i+1) /Y (i) = 1, then the results in ξ (i+1) and ξ (i) have exactly the same degree of agreement. Thus, one can define the reproducibility rate as follows: concept of result reproducibility falls apart. This example shows that one has to define the parameter and its estimator of the reproducibility rate by obeying the constraints of statistically desired properties for reproducibility rate to be a useful concept. It is wise to check that a new concept defined in a developing field is statistically well behaved. Statistical nuances might get lost in applications with important consequences for results reproducibility.
Some additional statistical properties of ϕ N given in definition 2.5 are as follows. The sampling distribution of ϕ N is asymptotically normal with E(ϕ N ) = Nϕ and Var(ϕ N ) = Nϕ(1 − ϕ) by the CLT. All else being equal, the results for which the true reproducibility rate is high or low have low variance for the estimator, and for the results for which the true reproducibility rate is around 0.5, the variance of the point estimator is large (largest when p = 0.5). Approximately 100% confidence intervals (and tests of approximately power 1) can arbitrarily be built, with the property that only finitely many of the confidence intervals do not contain the true reproducibility rate ϕ. This result, which fundamentally relies on the law of the iterated logarithm, constitutes a strong basis for statistical methods about ϕ. ▪

Appendix B
Proof of result 2.9 (constructive): The sequence of idealized experiments ξ (1) , ξ (2) , … given by definition 2.5 is a proper stochastic process, seen as a joint function of random sample D and of each value in the support of data-generating mechanism, x [ R. K, S and M A are not stochastic, so we condition on them. ξ (i) draws a simple random sample D (i) = X n (i) independent of all else. We note two facts for the proof: Result 2.9 is a convenient way to study replications and reproducibility. It has a number of mathematical implications. Firstly, it established that ξ is a well-behaved stochastic process with a limiting distribution. It is of interest to know the limit of this process. It tells us to which point the sample reproducibility rate from replication experiments converge.
Technically, the sequence of probability measures defined for the stochastic process associated with ξ (1) , ξ (2) , … on Borel sets with respect to the metric that we describe below has a limiting process that  and that of an independent sample of size 10 (red) emphasizing that the ECDF is a random variable whose probability distribution is determined by the sample values (and hence datagenerating mechanism). (c) One hundred independent samples of varying sample size (grey) emphasizing that ECDF is a stochastic process. Red vertical line shows the distribution of ECDF conditional on value x Ã .
convergences in distribution. Establishing this convergence helps us to understand the limiting behaviour of ξ (1) , ξ (2) , …, and characterizing this limiting behaviour. Donsker's Theorem characterizes the limiting process and states that ξ must convergence to the Wiener measure. Thus, the probability distribution of the reproducibility rate converges to the normal distribution. Readers interested in the theory of convergence in stochastic processes may refer to [52], chapters 1-3 for details. We give a brief description of necessary background here. There are three essential elements to study the convergence of a proper stochastic process: (i) a proper field on which the process takes values (the class of sets of interest) and a metric associated with it to assess the convergence of the process, (ii) the probability measure that determines the behaviour of the process, and (iii) using (i) and (ii), a complete mathematical formulation of the stochastic process, which can be used to show convergence to some well-defined distribution. We now consider a stochastic process as a function of t ∈ [0, 1], a random point in the space of right continuous functions on [0, 1] with left-hand limits. We let the supremum of the L1 norm between any two points in the space and the metric to assess the convergence to be the classical Kolmogorov-Smirnov distance. By bntc, we denote the floor function, the integer part of nt. Given fX n ¼ ðX 1 , X 2 , . . . , X n Þ; n [ Z þ g, where X i are independent of each other and identically distributed, we define the stochastic process defined on partial sums, P bntc i¼1 ½X i À EðX i Þ þ ½nt À bntc½X bntcþ1 À EðX i Þ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi nVarðX i Þ p : For elements of this process, if we denote the probability distribution for a sample size n by P n , then the limiting distribution is the well-known Wiener measure, W. Some results follow from this. ξ is most generic when M A is any probability model. This induces S post having the sampling distribution function of any statistic. In this most generic case, the distribution of the sample reproducibility rate ϕ N for the sequence ξ (1) , ξ (2) , … is asymptotically normal. To see this, we first let X n ¼ M À1 A ðwÞ, where w ∈ [0, 1] so that we have the image of the statistical model and assume that ϕ N evaluated at 0 and 1 is 0. The stochastic process ffiffiffi n p fj½M À1 A ðwÞ À wg converges to a specific Wiener process, with bound endpoints, which is a Brownian bridge: the process is Gaussian with zero expectation and, for two points w 1 and w 2 , the covariance function CovðWðw 1 Þ, Wðw 2 ÞÞ ¼ w 1 ð1 À w 2 Þ, with the ordering w 1 ≤ w 2 , and w i ∈ [0, 1].
By definition of this stochastic process and its convergence to a Brownian bridge, we see that for each fixed value of x, ξ is asymptotically normally distributed with mean M A and variance M A (1 − M A )/n.
The result can also be studied fixing one dimension at a time, giving two corollaries. For random data X n , the elements of the sequence of replication experiments ξ (1) , ξ (2) , … are random variables and conditionally independent of each other. For fixed data, the elements of the sequence of replication experiments ξ (1) , ξ (2) , … are probability models.

Appendix C
Details on remark 4.1: Let ξ be an idealized experiment and ξ 0 be its exact replication. Conditional on R from ξ, K 0 is necessarily distinct from K for epistemic reproducibility of R by R 0 , but not necessarily distinct for in-principle reproducibility of R by R 0 .
We define and distinguish in-principle reproducibility and epistemic reproducibility conditional on a result R. It is clear that π-openness, where π is a non-empty set and is necessary to make the elements of ξ available for replication ξ 0 . Further, R also needs to be open for ξ 0 to be able to determine whether R 0 has epistemically reproduced R. So, information on R across the sequence of replication experiments is a logical necessity for epistemic reproducibility. As an example, consider two scenarios 1 and 2. In each scenario, there are two experiments, the originals (ξ 1 and ξ 2 , respectively) and their replications (j This closed scenario shows that if there is no openness in the sense of information flow from one experiment to the next, it is improbable (but still possible) for an experiment to reproduce the result of another experiment. In order to acknowledge this point, we say that a result can only be in-principle reproducible if there is no epistemic exchange between ξ 1 and j 0 1 which could speak to the reproducibility of R, with the exception of via some omniscient external observer. At times, historians of science illustrate such examples of scientific discoveries independently arrived at by different scientists unaware of each other's work.
Open scenario: There is information flow from ξ 2 to j 0 2 , with respect to R and other information relevant to obtainp 0 in j 0 2 . If j 0 2 incorporates this information, it is a replication. Here, j 0 2 matches the elements of ξ 2 by social learning. The information necessary for learning is transmitted in K and R. Starting withp as R, j 0 2 could conclude that they have indeed reproduced it. Thus, in the open scenario, there is an epistemic interaction between ξ 2 and j 0 2 which contributes to the progress of science through deliberate transfer of knowledge via social learning, which gives us the notion of epistemic reproducibility.
As an example, we show the difference between epistemic reproducibility and in-principle reproducibility in figure 5 with an infinite population of black and white ravens and Bayesian inference. Figure 5a illustrates the closed scenario: researchers of ξ 1 assume a prior view of 1/2 onp. After observing n = 2 black ravens, they update their view top ¼ 3=4 by Bayesian inference. Researchers of j 0 1 assume a prior view of 1/2 onp and observe identical D v , n = 2 black ravens as in ξ 1 , and they update their view with same S post , to reachp ¼ 3=4. However, in the absence of an external observer, these two results cannot be epistemically connected, thus reproducibility is only in principle in the absence of an external observer privy to both experiments. Figure 5b illustrates the open scenario: researchers of ξ 2 assume a prior view of 1/2 onp. After observing n = 2 black ravens, they update their view top ¼ 3=4 by Bayesian inference. j 0 2 is a proper replication experiment. It is informed by the result of ξ 2 as well as K, M A , S and D s and observes identical D v as ξ 2 . ξ 2 0 , starting with a view ofp ¼ 3=4 from ξ 2 , they update their view top 0 ¼ 5=6. Thus, j 0 2 learns from ξ 2 , here in a Bayesian manner. The two results can be connected and thus reproducibility is epistemic.  royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 221042 -If S post is the MME, the motivation is to set the population mean equal to the sample mean and solve for p, and we havep MME ¼ b n :

Appendix D
-If S post is the posterior mode under the uniform prior (a special case of conjugate prior for ξ bin ), we havep Therefore, ξ can employ any one of these three estimators as S post , and ξ 0 can employ another as S 0 post and still reproduce R by R 0 , as if they have used the same statistical method. For other modes of statistical inference such as hypothesis tests and prediction, we can find examples of numerically equivalent methods that are not identical in motivation (e.g. [53]). We let f ð1Þ N , f ð2Þ N , . . . be estimates of reproducibility rates, with means ϕ (1) , ϕ (2) , … and variances N −1 ϕ (1) (1 − ϕ (1) ), N −1 ϕ (2) (1 − ϕ (2) ), …, respectively. We assume that the series P 1 i¼1 i À1 f ðiÞ ð1 À f ðiÞ Þ converges. Then, f ðiÞ , almost surely: ðG 1Þ Expression (G 1) states that the estimated reproducibility rate of results from non-exact replication experiments meaningfully converges to the mean true reproducibility rate of the idealized experiments performed. The case of exact replications given by equation (2.1) is a special case of the equation (G 1), where all non-exact replications are identical to each other (and thus exact) with respect to the result obtained in an original idealized experiment. That is, if equation (G 1) is applied to ξ ≡ ξ 1) ≡ ξ (2) ≡ · · · ≡ ξ (N ) , where the true reproducibility rate for R o obtained from ξ is ϕ, and we obtain N À1 Appendix H Reproducibility rate of R as a model selection problem, in the context of linear regression models.
In addition to the simulation example given in figure 3, here we present a second simulation example to illustrate the convergence of reproducibility rates from exact and non-exact replication experiments to their true value. Our example involves the model selection problem in the context of linear regression models. Briefly, we assume the linear regression model where y is n × 1 vector of responses, X is n × k matrix of fixed observables with first column entries equal to 1, β is k × 1 vector of parameters and e is n × 1 vector of independent and identically distributed normal errors with mean 0 and unknown variance. The statistical problem is as follows: Given D with D v independent and identically distributed and D s constituting n × 1 responses and n × k observables, select the best linear regression model among three models with respect to a model selection criterion (S post ). The saturated model is given by where x 1 , x 2 and x 3 are n × 1 vectors of first, second and third predictors, respectively, and β 1 , β 2 and β 3 are their respective regression coefficients. The set of three models considered in the model selection problem are as follows:  figure 6 display runs of replication experiments indexed by colour. Each run converges to a point (indicated by star) representing the true reproducibility rate of a given run. As a whole, these plots illustrate how true reproducibility rate changes depending on the elements of ξ and the effect of divergence of ξ 0 from ξ. We emphasize that all parameters of the simulation example in figure 6 are chosen so that one can discern the effect of varying models, methods, and data structures. We interpret the results as follows: 1. The reproducibility rates for false results and for true results sum to 1, which is a verification of simulation experiments. 2. By the true rates of reproducibility marked by stars, we observe that they depend on the true datagenerating mechanism, and the elements of the original experiment, S post and D s . For example, as the noise increases, the true reproducibility rate gets smaller, and the variance of the estimated reproducibility rate increases. So for larger noise, replication results are expected to be highly variable. True reproducibility rates of true results also change with sample size and method. 3. Reproducibility rate increases with the sample size for true results, whereas it decreases for false results such that low sample size makes false results more reproducible in our simulations. 4. Even when the true reproducibility rate is high, we might see a lot of variation in observed reproducibility rate after a small number of replications even when they are exact replications. Non-exact replications yield highly variable observed reproducibility rates that do not converge to the true reproducibility rate of the original result.
This simulation experiment complements the one presented in the main text (figure 3) by providing a different illustration from our toy example. The context of linear regression models is readily relevant to many practising scientists. Moreover, this simulation extends the results to new contexts by observing the outcome of interest under different levels of system noise and both true and false original results. Ultimately both simulations show considerable variability in true reproducibility rates as a function of the elements of and relationship between original and replication experiments.

Appendix I
True results are not necessarily reproducible and perfectly reproducible results may not be true.
Reproducibility is a function of the true unknown data-generating model and the elements of ξ. Devezer et al. [10] provide some account. We give a brief overview with a proof by counterexample. Conditional on R from ξ, we let ξ (1) , ξ (2) , … be exact replications of ξ and I {b Ã } be the indicator function that equals 1 if the first raven in the sample is black, and 0 otherwise. To prove the first part of the statement, we choose the estimatorp The estimatorp is valid on [0, 1] by: if b = n, then the first raven sampled must be black andp ¼ 1, else if b = 0, then the first raven must be white andp ¼ 0 such thatp [ ½0, 1. However,p is unbiased for p only royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 221042 with probability (1 − p). The reason is that the probability that first raven is white raven is (1 − p), and if it is a white raven, we getp ¼ b=n giving EðpÞ ¼ Eðb=nÞ ¼ ð1=nÞðnpÞ ¼ p. In contrast,p is biased for p with probability (1 − p). The reason is that the probability that first raven is black raven is p, and if it is a black raven, we obtain EðpÞ = p. This does not only show that the true results are not always reproducible but also shows that the reproducibility rate can be a function of the true parameter.
To prove the second part of the statement, choose the estimatorp ¼ c, where c is a constant in [0, 1]. EðpÞ ¼ c. This expectation is only equal to p when p = c. However, the result using thisp is reproducible with probability 1, thereby completing the proof.